Extracting directed information flow networks: an application to genetics and 

semantics 



A. P. Masucci 1 , A. Kalampokis 2 , V.M. Egufluz 1 , and E. Hernandez-Garcfa 1 
1-Instituto de Fisica Interdiscipltnar y Sistemas Complejos IFISC (CSIC-UIB), E-07122 Palma de Mallorca,Spain and 
2- Department of Marine Sciences, University of the Aegean, 811 00 Mytilene, Lesvos, Greece 

(Dated: December 30, 2010) 

We introduce a general method to infer the directional information flow between populations 
whose elements are described by n-dimensional vectors of symbolic attributes. The method is based 
on the Jensen-Shannon divergence and on the Shannon entropy and has a wide range of application. 
We show here the results of two applications: first extracting the network of genetic flow between 
the meadows of the seagrass Poseidonia Oceanica, where the meadow elements are specified by sets 
of microsatellite markers, then we extract the semantic flow network from a set of Wikipedia pages, 
showing the semantic channels between different areas of knowledge. 
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I. INTRODUCTION 

Since recent years 0, Q network theory has become 
a hot topic among scientists. Its applicability to many 
fields of science made of network theory a primary tool 
to understand a wide range of phenomena, both in pure 
science and in policy management 0] . In particular net- 
works can be used whenever we want to understand the 
morphology and the topology of a complex system of in- 
teracting elements. Such a topology can then be useful 
to predict and to understand the complex behaviours of 
those systems. This kind of analysis is particularly rele- 
vant in the case of the study of infection spreading, both 
in society and in the WWW [4| , or for understanding the 
formation and modification of urban conglomerates 

Most of the times the vertices of a network are single el- 
ements, such as individuals in social networks, tokens in 
linguistic networks, proteins in metabolic networks @, 
etc. Nevertheless we can consider networks where the 
vertices are not single elements, but ensembles of ele- 
ments, such as populations of a metapopulation 0, Q> 
and the links are given by some relations between those 
ensembles. 

In particular suppose that we have a set of N popula- 
tions, where each element of a population is described by 
a n-dimensional vector of symbolic or numeric attributes. 
Those attributes can be a set of social or ethnical indica- 
tors in the case of social systems (age, qualification level, 
salary, etc.), a set of genetic markers for biological popu- 
lations, etc. Then each population can be represented by 
its probability distribution in the n-dimensional attribute 
space and by its size, where the probability distribution 
counts the probability that an element in a population 
is characterised by a given vector in the attribute space. 
Thus we can consider the network where each population 
is a vertex and two vertices are linked whenever an in- 
formation flow between the two relative probability dis- 
tributions is detected. 

We use the term information flow, instead of corre- 
lation or distance, because we want to emphasise that 
in the systems we consider often the correlations arise 



from migration or inheritance of elements between differ- 
ent populations. Then the movement of elements from a 
population to another one corresponds to a movement of 
attributes or a flow of information in the attribute space 
where those populations arc defined. 

In general we can say that there is an information flow 
between two attribute distributions if the distributions 
are correlated and a direction for the informational in- 
teraction can be inferred. 

As an example we can imagine how geographical segre- 
gation [9( acts in a large city with a large social or ethnic 
diversity, such as New York, London, or Paris. In those 
cities more or less closed communities based on social 
or ethnic diversity form. Then we could be interested 
on how those communities interact with each other and 
which is the topology of interactions between them. To 
understand those interactions we can consider some inter- 
esting attributes that are proper of the elements of the 
different communities and measure the spread of those 
attributes between them, as wealth, habits, food con- 
sumed, etc. In other words we can estimate the infor- 
mation shared between the different communities of a 
given sample and establish a link between two communi- 
ties whenever we recognise an information flow between 
them. Thus the network of those interactions can give 
us precious information about the evolution of the meso- 
scopic systems defined by the different urban areas inside 
of the city macrosystem. 

In this paper we present a novel methodology based 
on the Jensen-Shannon divergence [To| and the Shannon 
entropy 11| to extract a directed network of information 
flows out of a set of populations, where the populations 
elements are specified by n-dimensional vectors of sym- 
bolic attributes. As we already mentioned, this method- 
ology has a wide range of applications from social physics, 
to economy and biology. We show here an application 
to genetics, where we measure the genetic flow between 
meadows of Poseidonia Oceanica, a Mediterranean sea- 
grass, and an application to semantics, where we measure 
the semantic flow between different pages of the on-line 
encyclopedia Wikipedia. In the first case we show that 
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the clusters of the resulting genetic network properly rep- 
resent the different geographical locations of the mead- 
ows, while in the latter case we show that different entries 
correctly cluster in appropriate semantic categories giv- 
ing hints to interesting semantic speculations. 



II. INFORMATION FLOW BETWEEN 
POPULATIONS DEFINED IN A SYMBOLIC 
ATTRIBUTE SPACE 

In literature there are different ways to compare prob- 
ability distributions [12]. A convenient one for the kind 
of systems we want to study is the Jensen-Shannon diver- 
gence (JSD hereafter) [10| . As we better explain below, 
we choose it because it is framed in information theory, 
it takes into account the different sizes of the popula- 
tions and the probability distributions don't have to be 
absolutely continuous in each other domini [l3j . 

Given two probability distributions P = {pi,p 2 , •■•} 
and Q — {qi, q 2 , •■•} of a discrete random variable, the 
JSD between P and Q is defined as: 

JSD{P\\Q) = HfaP + n 2 Q) - mH(P) - it 2 H(Q) (1) 

where iti are weights, that is 7Ti + ir 2 = 1 and H(P) = 
— T"! Pi hi Pi is the Shannon entropy measured in nats 

JSD was introduced in [10j and its properties are well 
reviewed in (l3| . For our purposes the most important 
feature of the JSD is that the two distributions we want 
to compare have not to be absolutely continuous in each 
other domini, as it happens for instance in the case of the 
Kullback-Leibler divergence flpj . In fact we want to com- 
pare distributions of attributes that are not necessarily 
shared by all the populations of the system. Moreover 
the JSD embeds a weighting system for the different dis- 
tributions and it was demonstrated in [l3[ that the op- 
timal choice for the weights is the statistical weight of 
the samples. This feature is necessary in order to com- 
pare populations that are different in size. Hence if the 
number of the elements of the population defined by the 
distribution P is n\ and the number of elements of the 
population defined by the distribution Q is n 2 , we define 
7T; = rii/(m + n 2 ). 

It has been demonstrated that the square root of JSD 
defines a metric in the case of populations of the same 
size, iv 1 — tt 2 = 1/2, while for different population 
sizes the triangular inequality has not been demonstrated 
yet QJ]. Moreover we have that < JSD(P\\Q) < 
— 7Ti In 7Ti -7r 2 ln7r 2 < In 2. JSD(P\\Q) = ^ P = Q 
and J SD(P\\Q) = — 7Tiln7Ti — ir 2 \nir 2 if and only if P 
and Q have disjoint domini. 

JSD measures the information flow between two distri- 
butions in terms of their shared elements and non-shared 
elements. To understand the meaning of the JSD we can 
refer to the example of the two probability distributions 
P and Q defined in a certain attribute space showed in 
FigfT] P is defined on an attribute dominium Dp, while 



Q is defined on a certain attribute dominium Dq . Let us 
call X = Dp 1J Dq and suppose that J = Dpf] Dq ^ 
is the joint attribute dominium of the two distributions, 
while D = X — J is the disjoint attribute dominium of the 
distributions. Then EqJT] can be split in the two differ- 
ent domini: JSD(P\\Q) = JSD(P\\Q)j + JSD(P\\Q) D , 
where JSD{P\\Q) D = HfaP + n 2 Q) D - mH{P) D - 
tt 2 H{Q) d = -7Ti hi7ri J^DPi ~ n 2 ln ^ d I*- Then the 
contribution given to the JSD by the disjoint domini is a 
statistical measure quantifying the non shared attribute 
distribution sizes. 

For the part of the joint dominium we have that 
JSD(P\\Q)j = -J^A^iPi + K2qi) ln(7ripj + n 2 qi) + 
ni^jPi^Pi + w 2 J2j 1* m <?»- JSD(P\\Q)j is the en- 
tropy of the weighted sum of the two distributions mi- 
nus the weighted sum of the entropy of the distributions, 
measured in the shared part of the attributes dominium. 
From an informational point of view we can say that if 
the sum of the distributions is broader than the single 
distributions, this results in a large value of the diver- 
gence. Otherwise if the weighted sum of the distribution 
has a larger informative value, hence a smaller entropy 
than the one we have from the single distributions, then 
we obtain a small divergence from the shared part of the 
attribute dominium. 
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FIG. 1: An example of two probability distributions P{X) 
and Q(X) denned in an attribute dominium X where a frac- 
tion J of the dominium is shared by the two distributions. 

The only issue we get through applying the JSD to a 
system composed by many populations is that its maxi- 
mum value depends on the population size. That means 
that we can find cases where the JSD of two uncorrelated 
distributions is smaller than the one of two correlated 
ones. To avoid this problem we introduce a new index D 
defined as the JSD normalised to its maximum value: 



D(P\\Q) 



JSD{P\\Q) 

-71"! ln.7Tl — 7T 2 ln7T 2 



(2) 



D(P\\Q) has the same properties of JSD(P\\Q) with the 
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difference that < D(P\\Q) < 1, where D(P\\Q) = ^ 
P = Q and D{P\\Q) = 1 J = 0. 



A. Directionality 

Z)(P||Q) as J5D(P||Q) is a symmetric quantity in its 
arguments, that is D(P\\Q) = D(Q\\P). Hence it doesn't 
give information about the directionality of the interac- 
tion. In order to infer a directionality for the information 
flow we borrow a rationale from sociology, in particular 
from the idea of geographical segregation. 

Geographical segregation is a concept that is wi dely 
used in many areas of science, such as sociology |9Ul5|, 
economy [3], geography [l2j, physics and biology [l8j. It 
refers to the inequality between population attribute dis- 
tributions inside of a metapopulation. In particular a 
population inside of a metapopulation is said to be seg- 
regated in respect to some attributes if those attributes 
are found with a consistent probability in that popula- 
tion and are not found with a significative probability in 
the other populations of the system. 
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FIG. 2: Following the notation of the example of FigQ] here 
we show the renormalised distribution of the shared elements 
Pj and Q j, in the joint dominium J. In this particular ex- 
ample it is evident that P.j is a peaked distribution, while Qj 
is uniformly distributed. Hence the distribution P j results to 
be more segregated than Q 7 . 

There are many indexes in literature to measure geo- 
graphical segregation Il9ll . A popular one is the Theil 's 
segregation index [2(J l2l| and it is based on information 
theory. The Theil index is the difference between the to- 
tal entropy of the system in respect to some attributes 
and the weighted sum of the entropy of the different pop- 
ulations and it is defined as T = Ht — X^i w iHi, where 
Ht is the total Shannon entropy of the system, Hi is the 
entropy of population i and Wi is its statistical weight. 
If T is close to it means that those attributes are not 
segregated in the system, but they are distributed more 
or less uniformly through it. If T is consistently larger 
than 0, it means that those attributes are segregated in 
one or more populations of the system. 

The Shannon entropy is a well defined measure to es- 
timate the amount of inequalities represented by a prob- 
ability distribution. It is large when the attribute fre- 



quency distribution is uniform and it increases with pop- 
ulation size. In our case a large entropy for an attribute 
ensemble represents the fact that different attributes are 
equally mixed and it is a hint of small segregation in the 
attribute space. Otherwise a small value of Shannon en- 
tropy is associated to a large inequality between attribute 
frequencies and to a small number of different attributes 
and it is an evidence of segregation for the population in 
the attribute space, where exchanges with other popula- 
tions are a few. Then, in general terms, if an information 
flow is detected between two populations we can argue 
that the origin of the information is in the most segre- 
gated distribution, where the Shannon entropy is smaller 
(see FigE]). 

Hence, given two distributions P and Q between which 
an information flow is detected, to infer the directionality 
of the flow we first consider the inequality of the two dis- 
tributions in the joint dominium. To do that we consider 
the distributions Pj and Q 7 , that are the distributions 
of the elements of P and Q that belong to the joint do- 
minium J, with their frequencies renormalised to unity 
in J (see Fig®. 

The number of attributes shared by two distributions is 
the same for both the distributions, hence the entropy H 
measured over the joint dominium J depends only on the 
relative frequencies of the attributes. In particular more 
peaked distributions have smaller entropy than broader 
distributions. Then we have to take into account the 
fact that the population sizes are different. In particular 
it is important to understand which is the ratio of the 
shared elements within the whole population. To do that 

we define the index fip = v J ' = v n* ano - we 
have < [ip t Q < 1. If for a certain distribution fi is 
close to one, it means that the shared attributes are the 
dominant part of that sample. Then an estimator for the 
information flow directionality between P and Q can be 
defined as 



I{P -»• Q) = -sign 



H(Pj) H{Q d 



MP 



(3) 



If I(P — > Q) = +1 the information carried by the at- 
tributes in the joint dominium of P is larger than the 
information carried by the attributes in the joint do- 
minium of Q. Then we can infer an information flow 
from the attribute distribution P to Q. Otherwise, if 
I(P — > Q) = —I, we can infer an information flow from 
the attribute distribution Q to P. 



III. GENETIC FLOW BETWEEN SEAGRASS 
MEADOWS 

In this section we build the genetic flow network within 
subpopulations of Posidonia Oceanica (PO hereafter). 
PO is a seagrass that is endemic to the Mediterranean 
Sea (22|. It can reproduce either sexually via floating 
fruits, either asexually spreading stems, tough the lat- 
ter way is the most common one. PO is a determinant 
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species for the Mediterranean ecosystem, since its large 
colonies give shelter to many other species. Generating 
a network of directed genetic flows between the meadows 
helps to understand how this species grew and populated 
the Mediterranean sea. 

The dataset is composed by a set of N = 37 meadows 
of PO, geographically distributed in N points. For every 
meadow, or population, a variable number M = 40 ±5 of 
ramets, or individuals, were genotyped in terms of n = 7 
microsatellite markers [22!]. 

Microsatellite markers are tandem nucleotide repeats 
that are present in the non-coding region of DNA [23| . 
Their function is not understood yet, but their regularity 
makes them optimal markers to identify individuals. In 
fact it is via microsatellite markers that DNA is inves- 
tigated in forensic trials. The same dataset was already 
analysed in [H, [24| with different genetic distances. Since 
each allele of a microsatellite marker is characterized by 
the number of repetitions of a specific DNA motif occur- 
ring at that microsatellite locus, each individual or ramet 
belonging to a given population is characterized by a set 
of n — 7 pairs of integer numbers. 

An example of such a ramet is [(151, 161), (164, 164), 
(210, 210), (234, 238), (159, 171), (178, 178), (178, 180)]. 
The allele number for each locus is expressed as a pair 
because PO is a diploid organism, that is its DNA is 
made of two complete sets of chromosomes, so that each 
number belongs to a given chromosome. However we 
don't know the exact order for the numbers of each pair, 
that is we don't know which are the numbers belonging 
to a given chromosome and the numbers belonging to the 
other one. For this reason the allele repetitions for each 
locus are ordered by their size. 

The classical way to treat this kind of data is to con- 
sider the ensemble of alleles at each locus and then to 
average over the loci [25|. Nevertheless this approach 
gets rid of the correlations between the alleles belonging 
to different loci. 

To avoid this problem we represent each ramet in a 7- 
dimensional space, IN 7 , where each dimension is a specific 
locus. Then for each ramet we consider all the possible 
combinations of all the pairs of alleles in the 7 loci. In 
this way we obtain for each ramet 2 = 128 points in 
the loci space, each point representing an equiprobable 
gamete representative of the ramet. 

As an example if we have a ramet with two diploid 
loci, (125,127) and (400,404), then we can represent that 
ramet with 2 2 = 4 points in a 2-dim space: (125,400), 
(125,404), (127,400), (127,404). In this way each popu- 
lation is represented by a set of 5078 ± 597 points in IN 7 , 
which is the statistical sample characterizing the proba- 
bility distribution function in that space. Moreover every 
homozygous locus gives birth to two equal points in IN 7 , 
this feature giving statistical strength to homozygosity 
in the resulting density distribution for the population. 
Thus we obtain 187904 representative points for the 37 
populations. 

Each meadow is completely specified by its probability 



distribution in the 7-dimensional loci space, each point of 
the distribution giving us the probability that a certain 
gamete is present in a given meadow and by its size. To 
generate the network of genetic flow between the mead- 
ows we apply Eqj2]and Eqj3]to the processed dataset. 

The measurement of the directional genetic flow be- 
tween meadows gives us a list of all the possible pairs 
of meadows separated by a directional genetic distance. 
Then we order the meadow pairs for increasing values of 
their genetic distance, and we define a network of mead- 
ows considering two meadows as linked when their ge- 
netic distance is smaller than a given threshold. 

When increasing the value of the threshold we obtain 
a growing network where the first links to form are the 
strongest in a genetic sense. We can analyse the net- 
work at different thresholds to see how the different clus- 
ters form and merge. A significative threshold to analyse 
the network is the percolation threshold (PT hereafter), 
when the main clusters of the network connects [6, 26]. 

In Fig|3]we show the resulting network at its percola- 
tion threshold. The network is displayed by a classical 
algorithm of spring embedding [27J that shows the emerg- 
ing clusters. No geographical data are considered to draw 
the network and the geographical map in the background 
is given to have an idea of the geographical spaces that 
are involved. Different colors are given to meadows be- 
longing to different geographical areas. As we can see 
the algorithms presented in Sec|lj] efficiently split the ge- 
ographical clusters of Spain, Sicily and Greece. In par- 
ticular the genetic channel between East Mediterranean 
and West Mediterranean Sea is well recognised with the 
link between a Greek meadow and a Sicilian one. More- 
over the detected direction of this latter genetic flow is 
in agreement with evolutionary hypothesis for the spread 
of PO in the Mediterranean Sea 181. 



IV. SEMANTIC FLOW BETWEEN WIKIPEDIA 
PAGES 

In this section we show an application of the method 
presented in Sec [IT] for the detection of semantic flows be- 
tween different written texts. In what follows we measure 
the semantic flow between a set of selected Wikipedia 
pages. In this paper we are mostly interested to show the 
reliability of the method, but its application to semantics 
can lead to interesting research for automatic semantic 
classification in digital media, or for human machine in- 
terfaces [H]. 

We consider 78 entries of the Wikipedia, selected be- 
tween 14 different categories and we calculate the direc- 
tional semantic flows between each pair of pages. First of 
all we process the text to get rid of all its structural to- 
kens, as articles, punctuation, the most common adverbs 
and adjectives. After that we lemmatise the text, that is 
we reduce all the verbs to their infinitive forms and all 
the plural words to singular. 

Then we consider each page as a population where the 
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FIG. 3: (Color online) Directed genetic flow network of PO meadows in the Mediterranean Sea. The network is displayed via a 
classical algorithm of spring embedding at its percolation threshold, without additional geographical information. Nevertheless 
the genetic clusters efficiently reflect the geographic locations for the meadows and the directions of the genetic flows agree 
with standard evolutionary theories for the PO. 



elements are the different processed words forming a 1- 
dimensional attribute space. Hence each page is defined 
by its content word distribution and by its size. 

As in SeclIHI we apply Eqf2] and Eq|3] to our dataset, 
thus obtaining a list of all the pages separated by a se- 
mantic distance. We order the list for increasing values of 
D so that we can analyse the network at different thresh- 
olds. Again an interesting threshold is the percolation 
threshold, since it shows the active semantic channels 
between very different areas of knowledge. 

The resulting network is displayed in FigfJ] Again 
the introduced algorithm is able to efficiently split the 
pages belonging to different categories in different clus- 
ters. Moreover we can see how the semantic flows can 
delimit different areas of knowledge. 

Many semantic aspects emerge from this analysis. For 
instance it is interesting to notice how the semantics used 
to describe movie directors is interlaced with the one used 
for literature writers. It is interesting to see how the se- 
mantics describing Karl Marx forms a bridge between 
the one used for philosophy and the one used for social- 
ism. It is interesting to see how the economics semantics 
is common to all the politics, while the financial one is 
proper only of capitalism. It is interesting to see how the 
different countries have a semantic description that lies 
between the politics one and religions one and how phi- 
losophy forms a semantic channel between politics and 
science. 



V. CONCLUSIONS 

In this research we introduced a method to measure di- 
rectional information flow between different populations 
belonging to a given system. The definition of such a 
system is very general, so that the applicability of the 
method is wide, even if the relation with metapopulation 
dynamics is evident Q ■ In particular the elements of the 
system can be described by a multidimensional vector 
of either numeric or symbolic attributes and the method 
takes in account the different population sizes. 

The improvement of this methodology over the classi- 
cal ones used to compare probability distributions [10| is 
that it is designed for a many-populations system, giving 
the chance to build a network of information flow. More- 
over the application of ideas coming out of geographi- 
cal segregation studies allows to address the question of 
directionality in the interaction, transforming the static 
idea of correlations or divergence between probability dis- 
tributions, in a dynamical idea of information flows be- 
tween subsystems of a given macrosystem. 

We showed two simple applications of the method to 
different scientific fields as semantics and genetics. In 
the first case we showed how the method can recognise 
the geographical locations of seagrass meadows via mi- 
crosatellite markers. In the latter case we showed how 
the method can easily map a portion of the semantic 
space via the analysis of word distribution in Wikipedia 
entries. 



6 




FIG. 4: (Color online) Semantic flow network between 78 Wikipedia pages selected within 14 different categories. The network 
is displayed at the percolation threshold via an automatic spring embedding technique. The clusters efficiently split the different 
semantic areas. 



The topic of genetic distance is wide and it is possible 
to find many different approaches to the problem in liter- 
ature 29] . To review them in comparison to our measure 
is out of the purpose of this paper. In particular the inter- 
ested reader can refer to [301 ] for a more detailed analysis. 
For now we can say in very general terms that the major 
improvements that our method brings in respect to other 
genetic distances resides first of all in the fact that our 
measure accounts of different population sizes and infers 
a directionality for the interaction. Another novelty of 
the approach is to consider the multidimensional gamete 
space, instead of considering the allele abundance aver- 
aged on the different loci. Thus considering the correla- 
tions between alleles in different loci narrows the analysis 
in a more recent evolutionary time-scale. 



The analysis of the whole semantic space as repre- 
sented by the network of semantic flows between the en- 
tries of the whole Wikipedia has non trivial properties 
and is presented in 31]. 

We think that apart from the mentioned case studies 
the presented methodology can have important applica- 
tions in fields such as sociology, sociogeography and eco- 
nomics. 
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